Add unique prefix - increasing counter #217

MML-coder · 2025-07-01T20:54:41Z

The SyntheticTextItemsGenerator was generating prompts that could trigger vLLM's automatic prefix caching, leading to hitting the prefix cache up to 80% in some cases during the performance benchmarking.

Implemented unique prefix injection to guarantee 0% prefix cache hit rate while maintaining realistic prompt characteristics.

Test:
Performing some tests on the H200 target accelerator to confirm the fix.

MML-coder · 2025-07-08T18:58:55Z

I am trying to figure out lint errors. When i run it locally they all seemed to have passed. :)

ruff check --fix tests/unit/dataset/test_synthetic.py
All checks passed!

MML-coder · 2025-07-08T19:01:09Z

End to end test:

Ran following command for inference server running llama

command:
`
guidellm benchmark --target 'http://llama-4-maverick-fp8-c94dbf44-predictor.kserve-e2e-perf.svc.cluster.local:8080/v1' --model RedHatAI/Llama-4-Maverick-17B-128E-Instruct-FP8 --processor RedHatAI/Llama-4-Maverick-17B-128E-Instruct-FP8 --data='{"prompt_tokens":512 ,"prompt_tokens_stdev":128,"prompt_tokens_min":1,"prompt_tokens_max":1024,"output_tokens":2048,"output_tokens_stdev":64,"output_tokens_min":1,"output_tokens_max":4096}' --rate-type concurrent --rate "100" --warmup-percent 0.2 --max-requests 500 --output-path output.json

`

VLLM output:
INFO 07-08 17:56:44 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1809.3 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.5%, Prefix cache hit rate: 0.0% INFO 07-08 17:56:54 [loggers.py:116] Engine 000: Avg prompt throughput: 121.7 tokens/s, Avg generation throughput: 1689.8 tokens/s, Running: 99 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.7%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:04 [loggers.py:116] Engine 000: Avg prompt throughput: 1136.3 tokens/s, Avg generation throughput: 1267.3 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.9%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:14 [loggers.py:116] Engine 000: Avg prompt throughput: 1584.5 tokens/s, Avg generation throughput: 1106.8 tokens/s, Running: 99 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.2%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:24 [loggers.py:116] Engine 000: Avg prompt throughput: 1471.5 tokens/s, Avg generation throughput: 1096.7 tokens/s, Running: 98 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.5%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:34 [loggers.py:116] Engine 000: Avg prompt throughput: 611.2 tokens/s, Avg generation throughput: 1518.6 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:44 [loggers.py:116] Engine 000: Avg prompt throughput: 52.7 tokens/s, Avg generation throughput: 1629.9 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.8%, Prefix cache hit rate: 0.0% INFO 07-08 17:57:54 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1759.5 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.3%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:04 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1769.4 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.9%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:14 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1769.2 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.4%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:24 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1789.4 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.9%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:34 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1799.9 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.4%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:44 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1839.6 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.0%, Prefix cache hit rate: 0.0% INFO 07-08 17:58:54 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1829.2 tokens/s, Running: 100 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.5%, Prefix cache hit rate: 0.0% INFO 07-08 17:59:04 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1724.9 tokens/s, Running: 92 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.4%, Prefix cache hit rate: 0.0% INFO 07-08 17:59:14 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1309.3 tokens/s, Running: 46 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.3%, Prefix cache hit rate: 0.0% INFO 07-08 17:59:24 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 426.8 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0% INFO 07-08 17:59:34 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% INFO 07-08 17:59:44 [loggers.py:116] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

nm-red-hat-upstream-automation-bot · 2025-07-09T19:24:25Z

📦 Build Artifacts Available
The build artifacts (.whl and .tar.gz) have been successfully generated and are available for download: https://github.com/neuralmagic/guidellm/actions/runs/16178342451/artifacts/3498543434.
They will be retained for up to 30 days.

MML-coder · 2025-07-09T21:04:40Z

pre-commit run --all-files trim trailing whitespace.................................................Passed fix end of files.........................................................Passed run linter...............................................................Passed run formatter............................................................Passed mypy.....................................................................Passed

src/guidellm/dataset/synthetic.py

tests/unit/dataset/test_synthetic.py

Signed-off-by: Samuel Monson <[email protected]>

Co-authored-by: Mehul <[email protected]> Co-authored-by: Samuel Monson <[email protected]> Signed-off-by: Samuel Monson <[email protected]>

sjmonson · 2025-08-18T20:43:52Z

Merging work into #183

MML-coder marked this pull request as ready for review July 8, 2025 18:58

MML-coder self-assigned this Jul 9, 2025

vllm-project deleted a comment from nm-red-hat-upstream-automation-bot bot Jul 9, 2025

sjmonson requested changes Jul 16, 2025

View reviewed changes

src/guidellm/dataset/synthetic.py Show resolved Hide resolved

src/guidellm/dataset/synthetic.py Outdated Show resolved Hide resolved

tests/unit/dataset/test_synthetic.py Outdated Show resolved Hide resolved

tests/unit/dataset/test_synthetic.py Outdated Show resolved Hide resolved

markurtz added this to the v0.3.0 milestone Aug 13, 2025

sjmonson assigned sjmonson and unassigned MML-coder Aug 14, 2025

sjmonson added 3 commits August 14, 2025 15:50

Add fixed prefix option to synthetic data

a5d5772

Signed-off-by: Samuel Monson <[email protected]>

Add prefix before decode

a3eed17

Signed-off-by: Samuel Monson <[email protected]>

Document prefix_tokens arg

94a4508

Signed-off-by: Samuel Monson <[email protected]>

sjmonson force-pushed the prefix_cache_invalidate branch 2 times, most recently from ca35625 to 6662be6 Compare August 14, 2025 20:36

MML-coder and others added 2 commits August 18, 2025 16:32

Add unique single-token prefix to every request

daf2f4c

Co-authored-by: Mehul <[email protected]> Co-authored-by: Samuel Monson <[email protected]> Signed-off-by: Samuel Monson <[email protected]>

Add unit tests

da29a71

sjmonson force-pushed the prefix_cache_invalidate branch from 6662be6 to da29a71 Compare August 18, 2025 20:38

sjmonson closed this Aug 18, 2025

sjmonson deleted the prefix_cache_invalidate branch August 18, 2025 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add unique prefix - increasing counter #217

Add unique prefix - increasing counter #217

Uh oh!

MML-coder commented Jul 1, 2025

Uh oh!

MML-coder commented Jul 8, 2025

Uh oh!

MML-coder commented Jul 8, 2025

Uh oh!

nm-red-hat-upstream-automation-bot bot commented Jul 9, 2025

Uh oh!

MML-coder commented Jul 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjmonson commented Aug 18, 2025

Uh oh!

Uh oh!

Add unique prefix - increasing counter #217

Add unique prefix - increasing counter #217

Uh oh!

Conversation

MML-coder commented Jul 1, 2025

Uh oh!

MML-coder commented Jul 8, 2025

Uh oh!

MML-coder commented Jul 8, 2025

Uh oh!

nm-red-hat-upstream-automation-bot bot commented Jul 9, 2025

Uh oh!

MML-coder commented Jul 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjmonson commented Aug 18, 2025

Uh oh!

Uh oh!